Regulating data storage based on copy quantity

ABSTRACT

In some examples, a server may receive a data file from one or more computing devices, and may store the data file at a storage system provided by a data storage service. The server may determine a number of copies of the data file to be stored at the storage system based on a number of a set of computing devices that store the data file. For example, the set of computing devices may be outside of the storage system, and the determined number of copies of the data file to be stored at the storage system may decrease when the number of the set of computing devices that store the data file increases. Additionally, the server may adjust the number of copies of the data file stored at the storage system based on the determined number of copies of the data file.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is the continuation of U.S. application Ser. No.14/044,498, filed Oct. 2, 2013, which application claims the benefit ofU.S. Provisional Patent Application No. 61/708,794, filed on Oct. 2,2012, which applications are incorporated by reference herein in theirentirety.

TECHNICAL FIELD

Several of the disclosed embodiments relate to data storage, and moreparticularly, to regulating a number of copies of a data FILE to bestored in a storage system based on a popularity of the data file.

BACKGROUND

Current storage services such as cloud storage services allow users tostore various multi-media content such as music files, video files,images, documents, etc. in the cloud. In order to provide for recoveryfrom data loss, the cloud storage services typically replicate thecontent and store various copies of the content at different storagesystems, and probably at different locations. This requires huge amountsof storage resources and other associated infrastructure and maintenanceresources to maintain the data centers. This can result in increasedcosts. Further, some of the data files stored for various users can beidentical. For example, a music file such as “Optimistic” by “Radiohead”is the same for any user storing that music file with cloud storageservice. The current cloud storage services store multiple copies of thesame file, e.g., one for each user who uploaded the music file, whichresults in a significant amount of space being used for storingidentical files. Accordingly, the current storage services areinefficient at least in terms of managing the available storage space.

SUMMARY

Technology is disclosed for regulating data storage based on apopularity of data files (“the technology”). Various embodiments of thetechnology provide for maintaining a fixed durability level of datafiles stored in a storage system by regulating a number of copies of thedata files stored in the storage system. One such embodiment includesregulating the number of copies of a particular data file stored in thestorage system based on a popularity of the particular data file amongvarious users who use the storage system. The number of copies stored inthe storage system is increased or decreased, including from/to zerocopies, based on the popularity of the particular data file. Further,the storage system can store either a complete data file or for aportion of the data file. Accordingly, the technology is applicable toeither the complete data file or a portion of the data file.

In some embodiments, the popularity of the particular data file isdetermined by computing a popularity value for the particular data file.The popularity value of the particular data file can be determined basedon a number of factors, including one or more of: (a) a number ofcomputing devices associated with one or more of the users that containthe particular data file, (b) a latency associated with reading theparticular data file from one or more of the computing devices thatcontain the particular data file, (c) a network bandwidth available forreading the particular data file from one or more of the computingdevices that contain the particular data file, (d) availability of anetwork connection with one or more of the computing devices thatcontain the particular data file for reading the particular data file,(e) a number of the users requiring storage for the same data file atthe storage system, or (f) access pattern of the particular data filefor a specific user or a subset of the users. In some embodiments, oneor more the above factors can be weighted relative to each other.

The popularity value can be determined in various units and usingvarious mathematical equations. One example expression of a popularityvalue can include a percentage value, where a popularity value of 100%can indicate that all the users serviced by the storage system have acopy of the particular data file on all their computing devices, theparticular data file can be fetched from any of the computing deviceswith a minimum latency, the particular data file is accessed frequentlyetc. On the other hand, a popularity value of 0% can indicate that noneof the users have a copy of the particular data file or it is notpossible to retrieve a copy within maximum accepted latency etc.

The number of copies stored in the storage system is increased ordecreased, including from/to zero copies, based on the popularity value.For example, if the popularity value of a particular data file is 100%,the storage system may not store any copies of the particular data filesince the particular data file is available at all the computing devicesof the users and can be retrieved from any of the computing devices atany time. On the other hand, if the popularity value of a particulardata file is 0% the storage system may store one or more copies of theparticular data file since the particular data file is not available atany of the computing devices or cannot be retrieved within a maximumaccepted latency etc. Generally, the higher the popularity of the datafile, the lower the number of copies of the data file that need to bestored at the storage system. Further, various popularity value rangesand number of copies that can be stored for each of the ranges can beconfigured, e.g., by an entity such as an administrator of the storageserver.

When a user requests a particular data file, a server determines whetherthe particular data file is available at the storage system. If theparticular data file is available at the storage system, the serverserves the request by fetching the file from the storage system. On theother hand, if the particular data file is not available at the storagesystem, the server serves the request by fetching the file from any ofthe other computing devices of the user and/or any of the computingdevices of other users that contain the particular data file.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates an environment where data storage regulation formaintaining a specified durability level for the data files at a storagesystem can be implemented.

FIG. 2 illustrates an example system that regulates a number of copiesof data files stored at a storage system based on a popularity value ofthe corresponding data file, consistent with various embodiments of thedisclosed technology.

FIG. 3 illustrates an example of a system for serving a particular datafile from the storage system, consistent with various embodiments of thedisclosed technology.

FIG. 4 illustrates a block diagram of a server that regulates the numberof copies of the data files stored at the storage system based on thepopularity values of the corresponding data files, consistent withvarious embodiments of the disclosed technology.

FIG. 5 illustrates a flow diagram for regulating data storage at astorage system based on a popularity value of data files, consistentwith various embodiments of the disclosed technology.

FIG. 6 illustrates an example process for serving a particular data filefrom the storage system, consistent with various embodiments of thedisclosed technology.

FIG. 7 is a block diagram of a computer system as may be used toimplement features of some embodiments of the disclosed technology.

DETAILED DESCRIPTION

Technology is disclosed for regulating data storage based on apopularity of data (“the technology”). Various embodiments of thetechnology provide for maintaining a fixed durability level of datafiles stored in a storage system by regulating a number of copies of thedata files stored in the storage system. One such embodiment includesregulating the number of copies of a particular data file stored in thestorage system based on the popularity of the particular data file amongvarious users who use the storage system. The number of copies stored inthe storage system is increased or decreased, including from/to zerocopies, based on the popularity of the particular data file. Further,the storage system can store either a complete data file or for aportion of the data file. Accordingly, the technology is applicable toeither the complete data file or a portion of the data file.

In some embodiments, the popularity of the particular data file isdetermined by computing a popularity value for the particular data file.The popularity value of the particular data file can be determined basedon a number of factors, including one or more of: (a) a number ofcomputing devices associated with one or more of the users that containthe particular data file, (b) a latency associated with reading theparticular data file from one or more of the computing devices thatcontain the particular data file, (c) a network bandwidth available forreading the particular data file from one or more of the computingdevices that contain the particular data file, (d) availability of anetwork connection with one or more of the computing devices thatcontain the particular data file for reading the particular data file,(e) a number of the users requiring storage for the same data file atthe storage system, or (f) access pattern of the particular data filefor a specific user or a subset of the users. In some embodiments, oneor more the above factors can be weighted relative to each other.

The popularity value can be determined in various units and usingvarious mathematical equations. One example expression of a popularityvalue can include a percentage value, where a popularity value of 100%can indicate that all the users serviced by the storage system have acopy of the particular data file on all their computing devices, theparticular data file can be fetched from any of the computing deviceswith a minimum latency, the particular data file is accessed frequentlyetc. On the other hand, a popularity value of 0% can indicate that noneof the users have a copy of the particular data file or it is notpossible to retrieve a copy within maximum accepted latency etc.

The number of copies stored in the storage system is increased ordecreased, including from/to zero copies, based on the popularity value.For example, if the popularity value of a particular data file is 100%the storage system may not store any copies of the particular data filesince the particular data file is available at all the computing devicesof the users and can be retrieved from any of the computing devices atany time. On the other hand, if the popularity value of a particulardata file is 0% the storage system may store one or more copies of theparticular data file since the particular data file is not available atany of the computing devices or cannot be retrieved within a maximumaccepted latency etc. Generally, the higher the popularity of the datafile, the lower the number of copies of the data file that need to bestored at the storage system. Further, various popularity value rangesand number of copies that can be stored for each of the ranges can beconfigured, e.g., by an entity such as an administrator of the storageserver.

When a user requests a particular data file, a server determines whetherthe particular data file is available at the storage system. If theparticular data file is available at the storage system, the serverserves the request by fetching the file from the storage system. On theother hand, if the particular data file is not available at the storagesystem, the server serves the request by fetching the file from any ofthe other computing devices of the user and/or any of the computingdevices of other users that contain the particular data file.

Environment

FIG. 1 illustrates an environment where data storage regulation formaintaining a specified durability level for the data files at a storagesystem can be implemented. The system 100 includes a storage system 105for storing data files received from computing devices 130-140 of users.The system 100 includes a cloud server 110 configured to handlecommunications between the computing devices 130-140 and the storagesystem 105. The communications can include data storage or retrievalrequests from the computing devices 130-140. In one embodiment, thecloud server 110 can be a server cluster having computer nodesinterconnected with each other by a network. The server cluster cancommunicate with storage system via the Internet or communicationnetworks. The storage system 105 contains storage nodes 112. Each of thestorage nodes 112 contains one or more processors 114 and storagedevices 116. The storage devices 116 can include optical disk storage,RAM, ROM, EEPROM, flash memory, phase change memory, magnetic cassettes,magnetic tapes, magnetic disk storage or any other computer storagemedium which can be used to store the desired information.

A cloud data interface 120 can also be included to receive data from andsend data to computing devices 130-140. The cloud data interface 120 caninclude network communication hardware and network connection logic toreceive the information from computing devices. The network can be alocal area network (LAN), wide area network (WAN) or the Internet. Thecloud data interface 120 may include a queuing mechanism to organizedata updates received from or sent to the computing devices 130-140.

Although FIG. 1 illustrates two computing devices 130-140, a personhaving ordinary skill in the art will readily understand that thetechnology disclosed herein can be applied to a single computing deviceor more than two computing devices connected to the cloud server 110.

The computing devices 130-140 include an operating system 132-142 tomanage the hardware resources of the computing devices 130-140 andprovide services for running computer applications 134-144 (e.g., mobileapplications running on mobile devices). The operating system 132-142facilitates execution of the computer applications 134-144 on thecomputing device 130-140. The computing devices 130-140 include at leastone local storage device 138-148 to store the computer applications134-144 and user data. The computing device 130 or 140 can be a desktopcomputer, a laptop computer, a tablet computer, an automobile computer,a game console, a smartphone, a personal digital assistant, or othercomputing devices capable of running computer applications, ascontemplated by a person having ordinary skill in the art.

The computer applications 134-144 stored in the computing devices130-140 can include applications for general productivity andinformation retrieval, including email, calendar, contacts, and stockmarket and weather information. The computer applications 134-144 canalso include applications in other categories, such as mobile games,factory automation, GPS and location-based services, banking,order-tracking, ticket purchases or any other categories as contemplatedby a person having ordinary skill in the art.

The operating system 132-142 of the computing devices 130-140 includessocket redirection modules 136-146 to redirect network messages. Thecomputer applications 134-144 generate and maintain network connectionsdirected to various remote servers (not illustrated). The remote serverscan include applications, products or services such as social networkingapplications that the users may interact with via the computerapplications 142-144. Instead of directly opening and maintaining thenetwork connections with these remote servers, the socket redirectionmodules 136-146 route all of the network messages for these connectionsof the computer applications 134-144 to the cloud server 110. The cloudserver 110 is responsible for opening and maintaining networkconnections with the remote servers.

All or some of the network connections of the computing devices 130-140are through the cloud server 110. The network connections can includeTransmission Control Protocol (TCP) connections, User Datagram Protocol(UDP) connections, or other types of network connections based on otherprotocols. When there are multiple computer applications 134-144 thatneed network connections to multiple remote servers, the computingdevices 130-140 only need to maintain one network connections with thecloud server 110. The cloud server 110 will in turn maintain multipleconnections with the remote servers on behalf of the computerapplications 134-144.

In various embodiments, the cloud server 110 maintains a certain levelof durability of the data files stored at the storage system 105 byregulating a number of copies of the data files stored at the storagesystem 105. In some embodiments, the cloud server 110 regulates thenumber of copies of the data files based on the popularity values of thedata files. For example, the more popular the data files are among theusers, the fewer the number of copies of the data files stored at thestorage system. Additional details with respect to regulating number ofcopies of the data files based on the popularity values are described atleast with reference to FIGS. 2-7.

FIG. 2 illustrates an example system that regulates a number of copiesof data files stored at a storage system based on a popularity value ofthe corresponding data file, consistent with various embodiments. Insome embodiments, the system 200 can be similar to a system such assystem 100 of FIG. 1. In some embodiments, the server 230 is similar tothe cloud server 110 and the storage system 235 can be similar tostorage system 105. In the figure, the storage system 235 a is anon-regulated data storage, and storage systems 235 b-d, are examples ofregulated storage systems in which the number of copies of data filesare regulated based on the popularity of data files. In an embodimentthe storage systems 235 a-d form a storage system 235 of the system 200.

The server 230 provides data storage services to a number of users,including a first user, a second user and a third user to store variousdata files. The data files can include files such as images, videos,logs, application configuration files, computing device configurationfiles etc. A user can upload data files from one or more computingdevices associated with the user to the server 230 via a communicationnetwork 225. For example, a first user can upload data file, File A,from a first computing device 205 and a second computing device 210.Similarly, the third computing device 215 uploads data file, “File A”and “File B” and the fourth computing device 220 “File A,” “File B” and“File C.” Accordingly, the server 230 stores four copies of data file,“File A” in the storage system 235, two copies of “File B” and one copyof “File C.” The storage system 235 a can have a number of storage unitsacross which the data files can be stored. Further, in some embodiments,the storage units can be spread across various geographical locations.

Typically, a storage system keeps a number of copies of the data filesin order to improve the durability of data files, e.g., to minimize theimpact due to data loss either at the user end or at the storage systemend. In some embodiments, the server 230 maintains a certain level ofdurability of the data files stored at the storage system 235 a byregulating a number of copies of the data files stored at the storagesystem 235 a based on the popularity of the data files. The more popularthe data files are among the users, the lower the number of copies ofthe data files are stored at the storage system.

The popularity of a particular data file is measured using a popularityvalue. In some embodiments, the popularity value of the particular datafile is determined based on a number of factors, including one or moreof: (a) a number of computing devices associated with one or more of theusers that contain the particular data file, (b) a latency associatedwith reading the particular data file from one or more of the computingdevices that contain the particular data file, (c) a network bandwidthavailable for reading the particular data file from one or more of thecomputing devices that contain the particular data file, (d)availability of a network connection with one or more of the computingdevices that contain the particular data file for reading the particulardata file, (e) a number of the users requiring storage for the same datafile at the storage system, or (f) access pattern of the particular datafile for a specific user or a subset of the users. In some embodiments,one or more the above factors can be weighted relative to each other andan overall popularity value of the particular data file can bedetermined as a function of the popularity value for one or more of theabove factors.

In some embodiments, the higher the number of computing devices thatcontain the particular data file, the higher is the popularity value ofthe particular data file. This may indicate that since the particulardata file is available from many computing devices, a lesser number ofcopies, including zero, may be stored at the storage system 235. When auser requests to retrieve the particular data file, the server obtainsthe particular data file from one of the computing devices and servesthe particular data file to the user.

In some embodiments, the higher the latency associated with reading theparticular data file from one or more of the computing devices thatcontain the particular data file, the lower the popularity value of theparticular data file is. In some embodiments, if the latency is above amaximum acceptable value, the server may determine to store a highernumber of copies at the storage system 235. In some embodiments, anoverall latency based popularity value may be determined as an averageof or as any other function of latency based popularity value of theparticular data file for each of the computer devices that contain theparticular data file.

In some embodiments, the higher the network bandwidth available forreading the particular data file from one or more of the computingdevices that contain the particular data file, the higher popularityvalue. In some embodiments, an overall network bandwidth basedpopularity value may be determined as an average or as any otherfunction of network bandwidth based popularity value of the particulardata file for each of the computer devices that contain the particulardata file.

In some embodiments, the higher the availability of a network connectionwith one or more of the computing devices that contain the particulardata file for reading the particular data file higher the popularityvalue of the particular data file. In some embodiments, an overallnetwork connection availability based popularity value may be determinedas an average or as any other function of network connectionavailability based popularity value of the particular data file for eachof the computer devices that contain the particular data file.

In some embodiments, the higher the number of the users requiringstorage for the same data file at the storage system the higher thepopularity value of the particular data file.

In some embodiments, the access pattern of the particular data file isconsidered for determining the popularity value. The access pattern canbe based on how frequently the particular data file stored at thestorage system 235 a is accessed or requested by a user who has uploadedthe particular data file. The higher the frequency of access, the higherthe number of copies stored at the storage system. If the frequency ofaccess is high, the server 230 may determine to store one or more copieson the storage system since it may be faster and more efficient toretrieve the data file from the storage system rather than the computingdevices of the users that contain the copy of the particular data file.Accordingly, the higher the frequency of access the lower the popularityvalue. Further, in some embodiments, the access pattern of theparticular data file may be considered not only for a particular userbut also for a subset of the users.

The popularity value can be determined in various units and usingvarious mathematical equations. One example expression of a popularityvalue can include a percentage value, where a popularity value of 100%can indicate that all the users serviced by the storage system have acopy of the particular data file on all their computing devices, theparticular data file can be fetched from any of the computing deviceswith a minimum latency, the particular data file is accessed frequentlyetc. On the other hand, a popularity value of 0% can indicate that noneof the users have a copy of the particular data file or it is notpossible retrieve a copy within maximum accepted latency etc.

The number of copies stored in the storage system is increased ordecreased, including from/to zero, based on the popularity value. Forexample, if the popularity value of a particular data file is 100% thestorage system may not store any copies of the particular data filesince the particular data file is available at all the computing devicesof the users and can be retrieved from any of the computing devices atany time. On the other hand, if the popularity value of a particulardata file is 0% the storage system may store one or more copies of theparticular data file since the particular data file is not available atany of the computing devices or cannot be retrieved within a maximumaccepted latency etc. Generally, the higher the popularity, the lowerthe number of copies of the data file stored at the storage system.Further, various popularity value ranges and number of copies that canbe stored for each of the ranges can be configured, e.g., by an entitysuch as an administrator of the storage server.

Referring back to the non-regulated storage system 235 a, the storagesystem 235 a includes four copies of “File A,” two copies of “File B”and a copy of “File C.” The server 230 may adjust the number of copiesof the above mentioned data files in one or more of the following ways:

Regarding “File A,” the server 230 may determine that “File A” has ahigh popularity value, e.g., because each of the four computing deviceshas a copy of “File A”, the availability of network connection with oneor more of the computing devices is high, etc. Accordingly, the server230 may decrease the number of copies of “File A” by half as shown inregulated storage systems 235 b-c. In some embodiments, the server 230may even determine not to store any copy of “File A” in the storagesystem as shown by example storage system 235 d.

Regarding “File B,” the server 230 may determine to retain the samenumber of copies based on the popularity value of “File B.” Regarding,“File C,” in some embodiments, the popularity value may indicate thatthat one of copy of “File C” is sufficient to be stored at the storagesystem, for e.g., because only one computing device needs the file, thefile is not accessed as frequently, etc. Accordingly, the server 230stores only one copy of “File C” as shown in the example storage system235 b. However, in some embodiments, the popularity value of “File C”may change even with just one user, e.g., if the user is travelling andthe network connectivity between the fourth computing device 220 and thestorage unit in the storage system 235 a that contains the copy of “FileC” may change when the user is at another geographical location. Thepopularity value of “File C” can change and therefore can have an effecton the number of copies stored at the storage system. The popularityvalue may indicate that two copies of the file be maintained at thestorage system. Accordingly, the server 230 may add another copy of“File C” at the storage system as shown in regulated storage systems 235c-d. In some embodiments, the server 230 may add another copy of the“File C” in the storage unit of storage systems 235 c-d that is closerto the location where the user has travelled to.

In some embodiments, the server 230 determines whether various datafiles uploaded by different users are similar by using various filecomparison techniques such as checksum, hash sum etc. The server 230generates a checksum for each of the files uploaded to the server 230for further storage at storage system 235 and stores the checksum ofeach of the data files in the storage system 235 or in another storagesystem separate from the storage system 235. The checksums may becalculated for a portion of the data file, e.g., a block of a file or asegment of file that has a plurality of blocks, or a complete data file.Further, the server 230 also stores the identifications of at least oneof the user and the computing device which uploaded a particular datafile. In some embodiments, the checksums and the identifications of theusers and/or computing devices are stored in a data file availabilitytable (not illustrated). The server 230 may use the data fileavailability table in determining the popularity value and also indetermining which of the computing devices has a particular data file.

In some embodiments, the server 230 can use various storage techniquesto store data efficiently. One example storage technique can includecompression of data files that compresses the data files so that thespace consumed by the data file is minimized. The computing devices caninclude devices such as a smart phone, a digital media player, a laptop,a desktop, a tablet PC etc.

FIG. 3 illustrates an example of a system 300 for serving a particulardata file from the storage system, consistent with various embodiments.In some embodiments, the system 300 can be similar to the system 200 ofFIG. 2, server 330 can be similar to the server 230 the computingdevices 305-320 can be similar to the computing devices 205-220,respectively, and the storage system 350 can be similar to the storagesystem 235 d. The users associated with the first computing device 305,second computing device 310, the third computing device 315 and thefourth computing device have uploaded one or more of data files “FileA,” “File B” and “File C” to the server 330 for storage as illustratedwith reference to FIG. 2.

The server 330 has adjusted the number of copies of the data filesstored at storage system 350. For example, while the server 330 hasstored two copies of “File B” and “File C” no copies of “File A” arestored at the storage system 350, e.g., because the “File A” has a highpopularity value due to being available from a number of computingdevices.

A computing device such as the third computing device requests theserver 330 to retrieve “File A” that it had uploaded earlier. The server330 determines whether the storage system 350 has a copy of “File A.” Ifthe storage server 350 has a copy of “File A,” then the server obtainsthe data file from the storage server 350 and serves the data file tothe third computing device 315. On the other hand, if the storage server350 does not have a copy of “File A,” the server 330 determines which ofthe computing devices has a copy of “File A.” In some embodiments, theavailability table 325 includes data specifying which of the computingdevices has which of the data files and also data specifying otherattributes such as network bandwidth for the computing devices, theirnetwork connection availability, associated latency to obtain the datafile, etc.

The server 330 checks with the availability table 325 to determine whichof the computing devices has a copy of “File A” and identifies aparticular computing device from which it can retrieve a copy of “FileA.” In some embodiments, the server 330 may select a computing device,e.g., first computing device 305, from which the copy of “File A” can beretrieved from least amount of latency. The server 330 retrieves thecopy of “File A” from the first computing device 305 and serves the datafile, “File A” to the third computing device 315. The third computingdevice 315 would not be aware of where the data file is retrieved from.From the perspective of the third computing device 315, the data file,“File A” is retrieved from the storage system 350.

FIG. 4 illustrates a block diagram of a server 400 that regulates thenumber of copies of the data files stored at the storage system based onthe popularity values of the corresponding data files, consistent withvarious embodiments of the disclosed technique. In some embodiments, theserver 400 can be similar to cloud server 110 of FIG. 1. The server 400can be, e.g., a dedicated standalone server, or implemented in a cloudcomputing service. The server 400 includes a network component 410, aprocessor 420, a memory 430, a request receiving module 440, apopularity value determination module 450, data file replicationmanagement module 460 and a data file serving module 470. The memory 430can include instructions which when executed by the processor 420enables the server 400 to perform the functions as described withreference to cloud server 110. The networking component 410 isconfigured for network communications with computing devices and remoteservers (not illustrated). The networking component 410 establishes adevice network connection with a computing device, and a server networkconnection with the storage system 105 in response to a request from thecomputing device for connecting with the storage system 105. The requestcan be generated by a computer application running at the computingdevice.

As explained above, the server 400 facilitates storing of data files ofthe users at a storage system such as storage system 105. The data filescan be received from one or more users and also from one or morecomputing devices of each of the users. For example, a user can beassociated with multiple computing devices such as smartphones, digitalmedia players, laptops, desktops, tablet PCs etc. The data files caninclude files such as images, videos, logs, application configurationfiles, computing device configuration files etc. The server 400maintains a certain level of durability of the data files stored at thestorage system 105 by regulating a number of copies of the data filesstored at the storage system 105. In some embodiments, the more popularthe data files are among the users, the lower the number of copies ofthe data files stored at the storage system 105.

The popularity of a particular data file is measured using a popularityvalue. The popularity value determination module 450 determines thepopularity of the data file based on a number of factors, including oneor more of: (a) a number of computing devices associated with one ormore of the users that contain the particular data file, (b) a latencyassociated with reading the particular data file from one or more of thecomputing devices that contain the particular data file, (c) a networkbandwidth available for reading the particular data file from one ormore of the computing devices that contain the particular data file, (d)availability of a network connection with one or more of the computingdevices that contain the particular data file for reading the particulardata file, (e) a number of the users requiring storage for the same datafile at the storage system, or (f) access pattern of the particular datafile for a specific user or a subset of the users. In some embodiments,one or more the above factors can be weighted relative to each other.The popularity value can be determined in various units and usingvarious mathematical equations. One example expression of a popularityvalue can include a percentage value.

The data file replication management module 460 determines the number ofcopies to be maintained at the storage system 105 for a particular datafile. Generally, higher the popularity of the data file, lower is thenumber of copies of the data file stored at the storage system 105.Further, various popularity value ranges and number of copies that canbe stored for each of the ranges can be configured, e.g., by an entitysuch as an administrator of the storage server. Further, in someembodiments, the data file replication management module 460 alsomaintains an availability table that includes data specifying which ofthe computing devices has copies of which of the data files, and alsoincludes data specifying other attributes such as network bandwidth forthe computing devices, their network connection availability, associatedlatency to obtain the data file, etc.

Request receiving module 440 receives requests from the users forstoring or retrieving data files at/from the storage system 105. In someembodiments, the request receiving module 440 receives the request viathe network component that facilitates communication with the computingdevices of the users.

Data file serving module 470 responds to the requests from a user forretrieving the data files from storage system 105 by retrieving the datafile and serving it to the user. The data file serving module 470 servesthe data file by either retrieving the data file from the storage system105 or from one of the computing devices if the storage system does nothave the requested data file. In some embodiments, the data file servingmodule 470 checks with the availability table to determine which of thecomputing devices has a copy of the requested data file, and retrievesthe copy of data file from one of the identified computing devices.

FIG. 5 illustrates a flow diagram for regulating data storage at astorage system based on a popularity value of data files, consistentwith various embodiments. The process 500 may be executed in a systemsuch as system 100 of FIG. 1. At step 505, the server 110 receives arequest to store a data file from one or more users. In someembodiments, if more than one user uploads the same data file, multiplecopies of the data file is created. At step 510, the server 110 storesthe multiple copies of the data file at the storage system 105.

At step 515, the server determines a popularity of the data file. Insome embodiments, the popularity of a data file is measured using apopularity value. The popularity value of the data file is determinedbased on a number of factors, including one or more of: (a) a number ofcomputing devices associated with one or more of the users that containthe particular data file, (b) a latency associated with reading theparticular data file from one or more of the computing devices thatcontain the particular data file, (c) a network bandwidth available forreading the particular data file from one or more of the computingdevices that contain the particular data file, (d) availability of anetwork connection with one or more of the computing devices thatcontain the particular data file for reading the particular data file,(e) a number of the users requiring storage for the same data file atthe storage system, or (f) access pattern of the particular data filefor a specific user or a subset of the users. In some embodiments, oneor more the above factors can be weighted relative to each other. Thepopularity value can be determined in various units and using variousmathematical equations. One example expression of a popularity value caninclude a percentage value.

At step 520, the server 110 determines a number of copies of the datafile to be stored at the storage system 105 based on the popularityvalue. In some embodiments, higher the popularity of the data file,lower is the number of copies of the data file stored at the storagesystem. For example, if the popularity value of a particular data fileis 100%, the storage system may not store any copies of the particulardata file since the particular data file is available at all thecomputing devices of the users and can be retrieved from any of thecomputing devices at any time. On the other hand, if the popularityvalue of a particular data file is 0% the storage system may store oneor more copies of the particular data file since the particular datafile is not available at any of the computing devices or cannot beretrieved within a maximum accepted latency etc. Further, variouspopularity value ranges and number of copies that can be stored for eachof the ranges can be configured, e.g., by an entity such as anadministrator of the storage server. For example, a popularity range of0-9% may have 5 copies, 10-40% may have 4 copies, 41-70% may have 3copies, 71-95% may have 2 copies and 96-100% may have 0 (Zero) copies.

At step 525, the server 110 regulates or adjusts the number of copies ofthe data file at the storage system by at least one of: (a) not storingany copy of the data file at the storage system if the popularity valueexceeds a first threshold, (b) increasing the number of copies stored atthe storage system if the popularity value is below a second threshold,or (c) decreasing the number of copies stored at the storage system ifthe popularity value exceeds a third threshold. In some embodiments, thenumber of copies of a data file can be regulated either for a completedata file or for a portion of the data file.

FIG. 6 illustrates an example process for serving a particular data filefrom the storage system, consistent with various embodiments. In anembodiment, the process 600 may be implemented in a system such assystem 100 of FIG. 1. At step 605, the server 110 receives a requestfrom a user to retrieve a data file of the user from a storage system.In some embodiments, the user can have one or more computing devicesassociated with the user. The user may request using any of thecomputing devices. At step 610, the server determines whether thestorage system 105 has an entire copy of the requested data file.Responsive to a determination that the storage system 105 has the entirecopy of the requested data file, at step 645, the server 110 serves thecopy of the requested data file to the user from the storage system 105.

On the other hand, responsive to a determination that the storage system105 does not have a copy of the entire data file, at step 615, theserver 110 determines whether the storage system 105 has a portion ofthe requested data file. Responsive to a determination that the storagesystem has a portion of the requested file, e.g., a first block orsegment etc., at step 620, the server 110 determines which of thecomputing devices of other users have a copy of the remaining portionsof the requested data file. In some embodiments, the server checks withthe availability table to determine which of the computing devices has acopy of the requested data file.

In some embodiments, the server 110 generates a checksum for each of thedata files uploaded by the users to the server 110 for storage of thedata files. The checksums may be calculated for a portion of the datafile, e.g., a block of a file or a segment of file that has a pluralityof blocks, or a complete data file. The server 110 stores the checksumsof the data files in the availability table. In some embodiments, theserver 110 also stores other attributes such as the names of the datafile, identifications of the computing devices from which the data filesare uploaded, a network bandwidth available for reading the copy offiles from the corresponding computing devices, a network availabilityfor connecting with the corresponding computing devices, associatedlatency etc. Some of the foregoing attributes may be updatedperiodically.

In some embodiments, the server 110 compares a checksum of the requesteddata file with the stored checksums of the data files to determine ifany of the computing devices has the copy of the requested data file.The server 110 chooses one of the computing devices to retrieve a copyof the requested data file from based on a predefined criterion. Forexample, the server 110 can choose a computing device from which thecopy of the data file can be read with least latency.

At step 625, the server 110 retrieves the copy of the remaining portionsof the data file from one of the computing devices. At step 630, theserver 110 generates an entire copy of the requested data file using theportions retrieved from the identified computing device and the storagesystem 105. The server 115 can use various file joining techniques forgenerating a file using various portions of the file. At step 645, theserver 110 serves the copy of the data file to the user.

Referring back to step 615, responsive to a determination that thestorage system does not have a portion of the requested file, at step635, the server 110 determines which of the computing devices of otherusers have a copy of the entire requested data file. At step 640, theserver 110 retrieves the copy of the entire data file from one of thecomputing devices and, at step 645, the server 110 serves the copy ofthe data file to the user.

Regardless of whether the data file is retrieved from the storage system105 or from the computing devices of the users, from the perspective ofthe user who requested the data file, the user sees the data file asbeing served from the storage system 105. The user may be unaware of thefact that the data file is retrieved from a computing device of anotheruser.

FIG. 7 is a block diagram of a computer system as may be used toimplement features of some embodiments of the disclosed technology. Thecomputing system 700 may be used to implement any of the entities,components or services depicted in the examples of FIGS. 1-6 (and anyother components described in this specification). The computing system700 may include one or more central processing units (“processors”) 705,memory 710, input/output devices 725 (e.g., keyboard and pointingdevices, display devices), storage devices 720 (e.g., disk drives), andnetwork adapters 730 (e.g., network interfaces) that are connected to aninterconnect 715. The interconnect 715 is illustrated as an abstractionthat represents any one or more separate physical buses, point to pointconnections, or both connected by appropriate bridges, adapters, orcontrollers. The interconnect 715, therefore, may include, for example,a system bus, a Peripheral Component Interconnect (PCI) bus orPCI-Express bus, a HyperTransport or industry standard architecture(ISA) bus, a small computer system interface (SCSI) bus, a universalserial bus (USB), IIC (I2C) bus, or an Institute of Electrical andElectronics Engineers (IEEE) standard 1394 bus, also called “Firewire”.

The memory 710 and storage devices 720 are computer-readable storagemedia that may store instructions that implement at least portions ofthe described technology. In addition, the data structures and messagestructures may be stored or transmitted via a data transmission medium,such as a signal on a communications link. Various communications linksmay be used, such as the Internet, a local area network, a wide areanetwork, or a point-to-point dial-up connection. Thus, computer-readablemedia can include computer-readable storage media (e.g.,“non-transitory” media) and computer-readable transmission media.

The instructions stored in memory 710 can be implemented as softwareand/or firmware to program the processor(s) 705 to carry out actionsdescribed above. In some embodiments, such software or firmware may beinitially provided to the processing system 700 by downloading it from aremote system through the computing system 700 (e.g., via networkadapter 730).

The technology introduced herein can be implemented by, for example,programmable circuitry (e.g., one or more microprocessors) programmedwith software and/or firmware, or entirely in special-purpose hardwired(non-programmable) circuitry, or in a combination of such forms.Special-purpose hardwired circuitry may be in the form of, for example,one or more ASICs, PLDs, FPGAs, etc.

Remarks

The above description and drawings are illustrative and are not to beconstrued as limiting. Numerous specific details are described toprovide a thorough understanding of the disclosure. However, in certaininstances, well-known details are not described in order to avoidobscuring the description.

Further, various modifications may be made without deviating from thescope of the invention. Accordingly, the invention is not limited exceptas by the appended claims.

Reference in this specification to “one embodiment” or “an embodiment”means that a particular feature, structure, or characteristic describedin connection with the embodiment is included in at least one embodimentof the disclosure. The appearances of the phrase “in one embodiment” invarious places in the specification are not necessarily all referring tothe same embodiment, nor are separate or alternative embodimentsmutually exclusive of other embodiments. Moreover, various features aredescribed which may be exhibited by some embodiments and not by others.Similarly, various requirements are described which may be requirementsfor some embodiments but not for other embodiments.

The terms used in this specification generally have their ordinarymeanings in the art, within the context of the disclosure, and in thespecific context where each term is used. Certain terms that are used todescribe the disclosure are discussed below, or elsewhere in thespecification, to provide additional guidance to the practitionerregarding the description of the disclosure. For convenience, certainterms may be highlighted, for example using italics and/or quotationmarks. The use of highlighting has no influence on the scope and meaningof a term; the scope and meaning of a term is the same, in the samecontext, whether or not it is highlighted. It will be appreciated thatthe same thing can be said in more than one way. One will recognize that“memory” is one form of a “storage” and that the terms may on occasionbe used interchangeably.

Consequently, alternative language and synonyms may be used for any oneor more of the terms discussed herein, nor is any special significanceto be placed upon whether or not a term is elaborated or discussedherein. Synonyms for certain terms are provided. A recital of one ormore synonyms does not exclude the use of other synonyms. The use ofexamples anywhere in this specification including examples of any termdiscussed herein is illustrative only, and is not intended to furtherlimit the scope and meaning of the disclosure or of any exemplifiedterm. Likewise, the disclosure is not limited to various embodimentsgiven in this specification.

Those skilled in the art will appreciate that the logic illustrated ineach of the flow diagrams discussed above, may be altered in variousways. For example, the order of the logic may be rearranged, substepsmay be performed in parallel, illustrated logic may be omitted; otherlogic may be included, etc.

Without intent to further limit the scope of the disclosure, examples ofinstruments, apparatus, methods and their related results according tothe embodiments of the present disclosure are given below. Note thattitles or subtitles may be used in the examples for convenience of areader, which in no way should limit the scope of the disclosure. Unlessotherwise defined, all technical and scientific terms used herein havethe same meaning as commonly understood by one of ordinary skill in theart to which this disclosure pertains. In the case of conflict, thepresent document, including definitions will control.

What is claimed:
 1. A method of regulating a data storage service, themethod comprising: receiving, at a server, a data file from one or morecomputing devices; storing, by the server, the data file at a storagesystem provided by the data storage service; determining, by the server,a number of copies of the data file to be stored at the storage systembased on a number of a set of computing devices that store the datafile, the set of computing devices being outside of the storage system,wherein the determined number of copies of the data file to be stored atthe storage system decreases when the number of the set of computingdevices that store the data file increases; and adjusting, by theserver, the number of copies of the data file stored at the storagesystem based on the determined number of copies of the data file.
 2. Themethod of claim 1, wherein the one or more computing devices areassociated with one or more users.
 3. The method of claim 3, wherein thenumber of copies of the data file to be stored at the storage system isdetermined further based on a number of the one or more users associatedwith the one or more computing devices that send the data file to theserver to be stored at the storage system.
 4. The method of claim 1,wherein the number of copies of the data file include copies of aportion of the data file.
 5. The method of claim 1, further comprisingdetermining a value of the data file as a function of the number of theset of computing devices that store the data file, wherein the number ofcopies of the data file to be stored at the storage system is determinedbased on the value of the data file.
 6. The method of claim 5, whereinadjusting the number of copies of the date file stored at the storagesystem includes increasing the number of copies of the data file storedat the storage system according to a value range to which the value ofthe data file corresponds.
 7. The method of claim 5, wherein adjustingthe number of copies of the data file stored at the storage systemincludes decreasing the number of copies of the data file stored at thestorage system according to a value range to which the value of the datafile corresponds.
 8. The method of claim 5, wherein the determining ofthe value of the data file comprises determining the value as a functionof a latency associated with reading the data file from one or more ofthe set of computing devices that store the data file.
 9. The method ofclaim 5, wherein the determining of the value of the data file comprisesdetermining the value as a function of a network bandwidth available forreading the data file from one or more of the set of computing devicesthat store the data file.
 10. The method of claim 5, wherein thedetermining of the value of the data file comprises determining thevalue as a function of availability of a network connection with one ormore of the set of computing devices that store the data file forreading the data file.
 11. The method of claim 5, wherein thedetermining of the value of the data file comprises determining thevalue as a function of access pattern of the data file for a specificuser.
 12. The method of claim 5, wherein the determining of the value ofthe data file comprises determining the value as a function of accesspattern of the data file for a subset of one or more users associatedwith the one or more computing devices.
 13. An apparatus for regulatinga data storage service, the apparatus comprising: a memory; and at leastone processor coupled to the memory and configured to: receive a datafile from one or more computing devices; store the data file at astorage system provided by the data storage service; determine a numberof copies of the data file to be stored at the storage system based on anumber of a set of computing devices that store the data file, the setof computing devices being outside of the storage system, wherein thedetermined number of copies of the data file to be stored at the storagesystem decreases when the number of the set of computing devices thatstore the data file increases; and adjust the number of copies of thedata file stored at the storage system based on the determined number ofcopies of the data file.
 14. The apparatus of claim 13, wherein the oneor more computing devices are associated with one or more users, whereinthe number of copies of the data file to be stored at the storage systemis determined further based on a number of the one or more usersassociated with the one or more computing devices that send the datafile to the apparatus to be stored at the storage system.
 15. Theapparatus of claim 13, wherein the at least one processor is furtherconfigured to determine a value of the data file as a function of thenumber of the set of computing devices that store the data file, whereinthe number of copies of the data file to be stored at the storage systemis determined based on the value of the data file.
 16. The apparatus ofclaim 15, wherein, to determine the value of the data file, the at leastone processor is configured to determine the value as a function of alatency associated with reading the data file from one or more of theset of computing devices that store the data file.
 17. The apparatus ofclaim 15, wherein, to determine the value of the data file, the at leastone processor is configured to determine the value as a function of anetwork bandwidth available for reading the data file from one or moreof the set of computing devices that store the data file.
 18. Theapparatus of claim 15, wherein, to determine the value of the data file,the at least one processor is configured to determine the value as afunction of availability of a network connection with one or more of theset of computing devices that store the data file for reading the datafile.
 19. The apparatus of claim 15, wherein, to determine the value ofthe data file, the at least one processor is configured to determine thevalue as a function of access pattern of the data file for a specificuser.
 20. The apparatus of claim 15, wherein, to determine the value ofthe data file, the at least one processor is configured to determine thevalue as a function of access pattern of the data file for a subset ofone or more users associated with the one or more computing devices.